-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix and optimize handling of vectorized memory accesses (#17767) #18095
Conversation
Hey @ptrendx , Thanks for submitting the PR
CI supported jobs: [centos-gpu, unix-cpu, website, edge, centos-cpu, miscellaneous, sanity, unix-gpu, clang, windows-gpu, windows-cpu] Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for inclusion in 1.7
* Vectorized loads for binary elemwise kernel * More generalization * Add backwardusenone * Remove the unused _backward_add op * Add vectorized backwardusein * Extending vectorization to more binary ops, binary ops with scalar and unary ops * Handling ElementwiseSum * Get rid of half2 in mshadow * Remove backward_elemwiseaddex * Revert "Remove the unused _backward_add op" This reverts commit f86da86. * Revert "Remove backward_elemwiseaddex" This reverts commit 7729114. * Add back the backward_add since C++ test relies on it * Test bcast implementations * First version of vecotrized bcast * Adding single side vectorized bcast kernel * Removing debug prints * Actually run the single side kernel * Move the default implementation of bcast to the vectorized one * Limit the new implementation to GPU only * Enabling vectorization when broadcast does not actually do broadcast * Cleaning * Cleaning part 2 * Fix for numpy ops using stuff from broadcast * Fix * Fix lint * Try to debug pinv numpy test * Fix * Fix the vectorized broadcast implementation for misaligned input pointers * Added tests * Added docs to cuda_vectorization.cuh * Another fix for broadcast and fix INT64 compilation * Optimize for aligned=true * 1 more addition to test * Reverting the change to Numpy op test * Trying mcmodel=medium to fix the failure in CMake static build * Revert "Trying mcmodel=medium to fix the failure in CMake static build" This reverts commit 1af684c. * Limiting the PR to just elementwise ops
ad91c46
to
c5893e3
Compare
Adding this PR to 1.7.0 roadmap #16864. |
@mxnet-bot run ci [windows-gpu] |
Jenkins CI successfully triggered : [windows-gpu] |
@ciyongch This PR is now merged to v1.x branch, please cherry pick it to v1.7. |
@ptrendx as the v1.7.x is already rebased, could you please submit it to v1.7.x, then we can get it merged once the CI pass? Thanks. |
Vectorized loads for binary elemwise kernel
More generalization
Add backwardusenone
Remove the unused _backward_add op
Add vectorized backwardusein
Extending vectorization to more binary ops, binary ops with scalar and
unary ops
Handling ElementwiseSum
Get rid of half2 in mshadow
Remove backward_elemwiseaddex
Revert "Remove the unused _backward_add op"
This reverts commit f86da86.
This reverts commit 7729114.
Add back the backward_add since C++ test relies on it
Test bcast implementations
First version of vecotrized bcast
Adding single side vectorized bcast kernel
Removing debug prints
Actually run the single side kernel
Move the default implementation of bcast to the vectorized one
Limit the new implementation to GPU only
Enabling vectorization when broadcast does not actually do broadcast
Cleaning
Cleaning part 2
Fix for numpy ops using stuff from broadcast
Fix
Fix lint
Try to debug pinv numpy test
Fix
Fix the vectorized broadcast implementation for misaligned input
pointers
Added tests
Added docs to cuda_vectorization.cuh
Another fix for broadcast and fix INT64 compilation
Optimize for aligned=true
1 more addition to test
Reverting the change to Numpy op test
Trying mcmodel=medium to fix the failure in CMake static build
Revert "Trying mcmodel=medium to fix the failure in CMake static build"
This reverts commit 1af684c.
Description
Cherry pick #17767 to v1.x branch. @ciyongch